Skip to content

Conversation

@kakra
Copy link
Owner

@kakra kakra commented Dec 14, 2025

Export patch series: https://github.com/kakra/linux/pull/40.patch

wakatime

btrfs: tiered allocation hints and queue-based read balancing

Special Thanks to @Forza-tng for extensive testing, feedback, and maintaining the documentation guide:
👉 Btrfs Allocator Hints and Read Policies Guide by Forza-tng


This PR introduces a set of patches to improve Btrfs performance and flexibility in heterogeneous storage environments (mixed SSD/HDD, tiered storage, bcache).

1. Allocator Hints (Data Placement)

Allows preferring specific devices for data or metadata allocations. This works by storing a hint in the persistent device item on-disk.

  • Essential for tiered setups: Force metadata onto SSDs (for speed) while keeping bulk data on HDDs.
  • Graceful removal: Mark devices to accept no new allocations, allowing to drain them naturally or via balance without racing against new writes.

2. Read Balancing Policies

Extends Btrfs RAID1 read balancing with a dynamic queue policy. The standard PID-based policy is often insufficient for mixed-device pools or high-IOPS workloads.

  • pid: (Default) Static hashing by process ID. Good for simple setups.
  • round-robin: Distributes reads equally. Good for aggregate throughput on identical disks.
  • queue: (Recommended) Routes requests to the device with the fewest in-flight requests (shortest queue).
    • Adapts instantly to device load and speed differences.
    • Avoids "stalling" on busy devices.
    • In benchmarks, this policy consistently delivered the highest IOPS and lowest latency, especially under mixed load.
  • devid: Pin reads to a specific device ID (mostly for testing).

(Note: Previous experimental latency-based policies were dropped in favor of queue due to better stability and lower complexity.)

3. Decoupling from Experimental Status

Important: Upstream Kernels (6.13+) use CONFIG_BTRFS_EXPERIMENTAL to gate various unstable work-in-progress features. To allow using the allocator hints and read policies without enabling potentially unstable upstream code, these features have been moved out of the experimental gate.

Recommendation: Remove the line CONFIG_BTRFS_EXPERIMENTAL from your .config before running make oldconfig. The build system will then prompt you specifically for the new options (CONFIG_BTRFS_ALLOCATOR_HINTS, CONFIG_BTRFS_READ_POLICIES), allowing you to enable them safely without turning on other experimental Btrfs features.


Quickstart Guide

Setting Allocator Hints

  1. Enable CONFIG_BTRFS_ALLOCATOR_HINTS in kernel config.
  2. Run btrfs device usage /mnt/path to identify your device IDs.
  3. Set the hint via sysfs (this persists on-disk, no udev rules needed!):
    echo <TYPE> | sudo tee /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type

Available Types:

  • 0: Prefer data (Default for HDDs).
  • 1: Prefer metadata (Recommended for SSDs/NVMe).
  • 2: Metadata only (Use with caution).
  • 3: Data only (Use with caution).
  • 4: None preferred (Avoids new allocations, useful to drain a drive).
  • Added: 5: None (Strictly prevents ANY new allocation, useful for parallel device remove).

After changing hints, a rebalance of metadata/data is required to move existing extents to their preferred location.

Enabling Read Policy

  1. Enable CONFIG_BTRFS_READ_POLICIES in kernel config.
  2. Set boot parameter: btrfs.read_policy=queue
  3. Or switch at runtime: echo queue | sudo tee /sys/fs/btrfs/<UUID>/read_policy

Diagnostic Statistics

Adds per-device read statistics to /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats.

ios %lu wait %llu avg %llu age %llu ignored %llu
  • ios: Total read I/O count.
  • wait: Total accumulated wait time (ns).
  • avg: Cumulative average read latency (ns).
  • age: "Fairness" counter. Increments when the device is skipped/ignored during selection. Resets to 0 when selected. A constantly high age indicates the device is being avoided by the policy.
  • ignored: Total count of times this device was a candidate but skipped.

Benchmark Results

The following benchmarks (based on kernel 6.12 LTS) compare the new policies against the defaults. Tests were performed on a mixed HDD RAID10 array with bcache, comparing an idle system vs. a system under heavy background load (defrag).

queue proved to be the superior all-rounder, effectively isolating foreground workloads from background noise.

Scenario: No Background Load

Policy RandRead 4k QD1 (Lat) RandRead 4k QD32 (IOPS) SeqRead 1M (BW)
pid 65 IOPS 537 IOPS 261 MiB/s
round-robin 241 IOPS 1180 IOPS 231 MiB/s
latency-rr* 702 IOPS 2477 IOPS 240 MiB/s
queue 1181 IOPS 3647 IOPS 272 MiB/s

Scenario: Heavy Background Load (Defrag)

Policy RandRead 4k QD1 (Lat) RandRead 4k QD32 (IOPS) SeqRead 1M (BW)
pid ~0 IOPS (Stalled) 505 IOPS 121 MiB/s
round-robin 38 IOPS 717 IOPS 126 MiB/s
latency-rr* 585 IOPS 1562 IOPS 235 MiB/s
queue 967 IOPS 2437 IOPS 247 MiB/s

(latency-rr was an experimental hybrid policy used during testing, superseded by queue due to better performance and simplicity)


Changes in this version (Kernel 6.18 Port)

  • Ported to Linux 6.18.
  • Refactored Kconfig: Features are individually selectable and moved out of "Experimental".
  • Robustness: Fixed potential NULL pointer dereferences in stats tracking during mount/unmount and race conditions in round-robin calculation.
  • Simplified: Dropped complex EMA-based latency tracking in favor of the robust queue policy.

FAQ: Why is this not upstream?

1. Allocator Hints

The allocator hint patches (originally developed by Goffredo Baroncelli, now maintained here) have been discussed on the mailing list but were not merged for design reasons:

  • Free Space Calculation (df): Btrfs calculates available space assuming any chunk can be allocated on any device (respecting RAID profiles). Restricting allocations via hints makes this calculation unreliable. Tools might report free space while Btrfs returns ENOSPC (No space left on device) because the allowed devices for a specific chunk type are full, even if other devices are empty.
  • Maintenance: The original author ceased updates for newer kernels; this repository bridges that gap.

Compatibility Note: This patch reuses the existing (unused) type field in the device item on disk. It does not change the on-disk format version. Unpatched kernels simply ignore the value, ensuring data remains accessible (though allocation preferences will be lost until booted with a patched kernel).

2. Read Policies (queue)

The new queue policy is an experimental addition in this patch set. It is unlikely to be accepted upstream in its current form due to Layer Violation:

  • The filesystem layer (Btrfs) directly accesses internal block-layer statistics (in-flight queue depth) to make routing decisions. The Linux kernel generally enforces strict separation between these subsystems.
  • However, benchmarks show this cross-layer optimization yields significant performance gains in mixed setups, justifying its inclusion here.

fdmanana and others added 2 commits December 13, 2025 16:25
The following kernel message may be logged if `add_inline_refs()` or
`add_keyed_refs()` block for too long:

> kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> kernel: rcu:         10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052
> kernel: rcu:         (t=2100 jiffies g=358306033 q=2241752 ncpus=16)
> kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1
> kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48
> kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200
> kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20
> kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015
> kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a
> kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800
> kernel: FS:  000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> kernel: Call Trace:
> kernel:  <IRQ>
> kernel:  ? rcu_dump_cpu_stacks+0xd3/0x100
> kernel:  ? rcu_sched_clock_irq+0x4ff/0x920
> kernel:  ? update_process_times+0x6c/0xa0
> kernel:  ? tick_nohz_handler+0x82/0x110
> kernel:  ? tick_do_update_jiffies64+0xd0/0xd0
> kernel:  ? __hrtimer_run_queues+0x10b/0x190
> kernel:  ? hrtimer_interrupt+0xf1/0x200
> kernel:  ? __sysvec_apic_timer_interrupt+0x44/0x50
> kernel:  ? sysvec_apic_timer_interrupt+0x60/0x80
> kernel:  </IRQ>
> kernel:  <TASK>
> kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> kernel:  ? btrfs_get_64+0x65/0x110
> kernel:  find_parent_nodes+0x1b84/0x1dc0
> kernel:  btrfs_find_all_leafs+0x31/0xd0
> kernel:  ? queued_write_lock_slowpath+0x30/0x70
> kernel:  iterate_extent_inodes+0x6f/0x370
> kernel:  ? update_share_count+0x60/0x60
> kernel:  ? extent_from_logical+0x139/0x190
> kernel:  ? release_extent_buffer+0x96/0xb0
> kernel:  iterate_inodes_from_logical+0xaa/0xd0
> kernel:  btrfs_ioctl_logical_to_ino+0xaa/0x150
> kernel:  __x64_sys_ioctl+0x84/0xc0
> kernel:  do_syscall_64+0x47/0x100
> kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> kernel: RIP: 0033:0x55d07617eaaf
> kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf
> kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003
> kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000
> kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0
> kernel:  </TASK>

The RCU stall could be because there's a large number of backrefs for
some extents and we're spending too much time looping over them
without ever yielding the cpu.

Avoid the stall warning by adding `conf_resched()`.

Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72
Signed-off-by: Kai Krakow <kk@netactive.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra mentioned this pull request Dec 14, 2025
@CHN-beta
Copy link

CHN-beta commented Dec 16, 2025

I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type did not exist. I can confirm that the patch is applied, and the corresponding kernel options are set.

$ zcat /proc/config.gz | grep -i btrfs
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
CONFIG_BTRFS_ALLOCATOR_HINTS=y
# CONFIG_BTRFS_PER_DEVICE_IO_STATS is not set
CONFIG_BTRFS_READ_POLICIES=y
CONFIG_BTRFS_EXPERIMENTAL=y
$ ls /sys/fs/btrfs/f2dfd4a4-276d-4451-999e-a39457f032b5/devinfo/2
error_stats  fsid  in_fs_metadata  missing  replace_target  scrub_speed_max  writeable

Is there anything I am missing?

@Forza-tng
Copy link

I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type did not exist. I can confirm that the patch is applied, and the corresponding kernel options are set.

$ zcat /proc/config.gz | grep -i btrfs
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
CONFIG_BTRFS_ALLOCATOR_HINTS=y
# CONFIG_BTRFS_PER_DEVICE_IO_STATS is not set
CONFIG_BTRFS_READ_POLICIES=y
CONFIG_BTRFS_EXPERIMENTAL=y
$ ls /sys/fs/btrfs/f2dfd4a4-276d-4451-999e-a39457f032b5/devinfo/2
error_stats  fsid  in_fs_metadata  missing  replace_target  scrub_speed_max  writeable

Is there anything I am missing?

Thanks for the reporty. I can confirm the same.

@kakra
Copy link
Owner Author

kakra commented Dec 16, 2025

Thanks for the report. Will fix...

kakra and others added 3 commits December 16, 2025 14:49
Add the following flags to give a hint about which chunk should be
allocated on which a disk.

The following flags are created:

- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
  preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
  preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
  only data chunk allowed

Co-authored-by: Goffredo Baroncelli <kreijack@inwid.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
v2: Adds a check to prevent modification while the file system is still mounting.

Todo:

- Transactions should not be triggered from sysfw writes, see:
  https://lore.kernel.org/linux-btrfs/20251213200920.1808679-1-kai@kaishome.de/

Link: #36 (comment)
Reported-by: Eli Venter <eli@genedx.com>
Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra force-pushed the rebase-6.18/btrfs-patches branch from 7e81d2c to 8a8411c Compare December 16, 2025 13:57
@kakra
Copy link
Owner Author

kakra commented Dec 16, 2025

@CHN-beta @Forza-tng Thanks for reporting and confirming. This was actually a bug I introduced when I made allocator hints configurable via make menuconfig: I used the wrong definition in the C code, the allocator hints actually never compiled and were not active (it thus also slipped through my compile tests which showed two syntax issues now). I never verified if the type fields still existed, I'll keep this in mind for the future. Sorry.

Important: This means, whoever used the 6.18 patches until now, never had allocator hints enabled since. Please use the new patches, then verify that devinfo/*/type exists (it will still have your original type value). If it exists, it means the patch is working now. But you'll need to check if your btrfs moved meta data to the slow devices:

# btrfs filesystem usage -T {BTRFS-MOUNT-PATH}
...
                  Data    Metadata System
Id Path           RAID1   RAID1    RAID1    Unallocated Total     Slack
-- -------------- ------- -------- -------- ----------- --------- -------
 1 /dev/bcache2   2.51TiB        -        -     1.12TiB   3.63TiB 3.50KiB
 2 /dev/bcache0   2.52TiB        -        -     1.12TiB   3.63TiB 3.50KiB
 4 /dev/nvme0n1p2       - 86.00GiB 32.00MiB    41.97GiB 128.00GiB       -
 6 /dev/nvme1n1p2       - 86.00GiB 32.00MiB    41.97GiB 128.00GiB       -
 7 /dev/bcache3   2.52TiB        -        -     1.12TiB   3.64TiB       -
 8 /dev/bcache1   2.50TiB        -        -     1.14TiB   3.64TiB       -
-- -------------- ------- -------- -------- ----------- --------- -------
   Total          5.03TiB 86.00GiB 32.00MiB     4.57TiB  14.79TiB 7.00KiB
   Used           4.92TiB 30.35GiB  1.05MiB

If it lists devices with unexpected meta data, take note of the affected device IDs, then run a meta data balance filtered for device ID (separate each ID by spaces):

for ID in {SLOW_DEV_IDs}; do btrfs balance start -mdevid=$ID --enqueue {BTRFS-MOUNT-PATH}; done

E.g., run for ID in 1 7; do ... if your meta data ended up unwanted on device ID 1 and 7. Balance will then rewrite all meta chunks on device 1 and 7, effectively re-allocating it from the fast dedicated devices, without touching existing meta data chunks on other devices. It should be a fast and safe operation.

Thanks.

@CHN-beta
Copy link

I never verified if the type fields still existed, I'll keep this in mind for the future. Sorry.

Please don't worry about it. Everyone makes mistakes sometimes, and this one didn't actually cause any damage. Thank you for your contribution!

kakra and others added 6 commits December 17, 2025 04:19
When this mode is enabled, the chunk allocation policy is modified as
follows:

Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk
  for the X chunk type (the other type may be allowed when the space is
  low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
  This means also that it is a preferred choice.

Each time the allocator allocates a chunk of type X, first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X.

If the space is not enough, it uses also the disks tagged as
ALLOCATION_METADATA_ONLY.

If the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y is the
other type of chunk (i.e. not X).

Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks on a
disk which is going to be removed from the pool anyways, e.g. due to
bad blocks or because it's slow.

Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks to
a set of multiple disks which are going to be removed from the pool.
This acts as a multiple `btrfs dev remove` on steroids that can remove
multiple disks in parallel without moving data to disks which would be
removed in the next round. In such cases, it will avoid moving the
same data multiple times, and thus avoid placing it on potentially bad
disks.

Thanks to @Zygo for the explanation and suggestion.

Link: kdave/btrfs-progs#907 (comment)
Signed-off-by: Kai Krakow <kai@kaishome.de>
This adds read stats per device to devinfo to evaluate the effects of
different read policies better.

This adds a new file /sys/fs/btrfs/BTRFS-UUID/devinfo/ID/read_stats.

Signed-off-by: Kai Krakow <kai@kaishome.de>
Read policies seem safe and stable enough to move it out of the
experimental feature set. This allows us to add more policies without
forcing users to enable the full experimental feature set.

Signed-off-by: Kai Krakow <kai@kaishome.de>
Select the preferred stripe based on the mirror with the least
in-flight requests.

Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra force-pushed the rebase-6.18/btrfs-patches branch from 8a8411c to 6137992 Compare December 17, 2025 03:40
@kakra
Copy link
Owner Author

kakra commented Dec 17, 2025

Updated the branch to improve the style of some if statements.

@kakra
Copy link
Owner Author

kakra commented Dec 17, 2025

Added a Github workflow to automatically compile and build-check the patches.

@kakra kakra requested a review from Copilot December 20, 2025 15:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants